fix: null in array_agg with DISTINCT and IGNORE#19736
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes a bug where the IGNORE NULLS clause was being lost when optimizing ARRAY_AGG(DISTINCT x IGNORE NULLS) queries. The SingleDistinctToGroupBy optimizer was incorrectly discarding the null_treatment, filter, and order_by parameters when rewriting DISTINCT aggregates into GROUP BY operations.
Changes:
- Modified the optimizer to preserve aggregate function parameters (
null_treatment,filter,order_by) during the DISTINCT-to-GROUP-BY transformation - Added a regression test to verify
ARRAY_AGG(DISTINCT x IGNORE NULLS)correctly filters NULL values
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| datafusion/optimizer/src/single_distinct_to_groupby.rs | Extracts and preserves filter, order_by, and null_treatment parameters when rewriting DISTINCT aggregates |
| datafusion/sqllogictest/test_files/aggregate.slt | Adds regression test for ARRAY_AGG(DISTINCT ... IGNORE NULLS) functionality |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Jefffrey
left a comment
There was a problem hiding this comment.
Nice spot. I notice in the branch below it also doesn't carry over the properties, is this something we should also fix?
datafusion/datafusion/optimizer/src/single_distinct_to_groupby.rs
Lines 212 to 234 in 458b491
Jefffrey
left a comment
There was a problem hiding this comment.
Is it possible to find a test case for the 2-phase rewrite cases? 🤔
| func, | ||
| vec![col(&alias_str)], | ||
| false, | ||
| None, |
There was a problem hiding this comment.
There's another case here
There was a problem hiding this comment.
I’ll look into whether we can add a minimal test case covering the 2-phase rewrite scenario.
If I understood correctly, I’ve added a couple of tests covering those cases. |
|
Looking into it some more, it looks like the two-phase aggregation related changes aren't really needed as they don't affect anything 🤔 The introduced tests don't fail on main, and it seems its because the only supported aggregates for the two-phase aggregation branch are min/max/sum: datafusion/datafusion/optimizer/src/single_distinct_to_groupby.rs Lines 85 to 94 in f9697c1
I guess it doesn't hurt to keep the fix but they won't actually affect anything (since datafusion/datafusion/optimizer/src/single_distinct_to_groupby.rs Lines 81 to 83 in f9697c1 |
|
Thanks @davidlghellin |
Which issue does this PR close?
DISTINCTandIGNORE NULLS#19735.Rationale for this change
The
SingleDistinctToGroupByoptimizer rewrites aggregate functions withDISTINCTinto aGROUP BYoperation for better performance. However, during this rewrite, it was discarding important aggregate function parameters:null_treatment,filter, andorder_by.This caused queries like
ARRAY_AGG(DISTINCT x IGNORE NULLS)to include NULL values in the result because theIGNORE NULLSclause (stored as null_treatment) was being lost during optimization.What changes are included in this PR?
Preserve aggregate parameters in optimizer: Modified
SingleDistinctToGroupByto extract and preservenull_treatment,filter, andorder_byfrom the original aggregate function when creating the rewritten version.Add regression test: Added SQL logic test to verify that
ARRAY_AGG(DISTINCT x IGNORE NULLS)correctly filters out NULL values.Files changed:
datafusion/optimizer/src/single_distinct_to_groupby.rs: Extract and pass through filter, order_by, and null_treatment parameters
datafusion/sqllogictest/test_files/aggregate.slt: Add test case for ARRAY_AGG(DISTINCT ... IGNORE NULLS)
Are these changes tested?
Yes:
New SQL logic test in aggregate.slt verifies the fix works end-to-end
Existing optimizer tests continue to pass (19 tests in
single_distinct_to_groupby)Existing aggregate tests continue to pass (20 tests in
array_agg)Are there any user-facing changes?
Bug fix - Users can now correctly use
IGNORE NULLS(andFILTER/ORDER BY) withDISTINCTaggregates:Before (broken):
After (fixed):